Distributions and non-parametric (distribution-free)
tests
(Caveat: This list is sort of casual and may have some
slight errors, especially in the descriptions of the non-parametric tests.)
COMMONLY USED DISTRIBUTIONS
binomial distribution
- distribution
of possible event outcomes for varying numbers of trials with two complementary
probabilities for occurrence and non-occurrence of the event, p and q (= 1 - p),
such as for a coin flip where heads p = .5 and q = .5
- for
example, the probability of flipping a coin three times and getting two heads
and one tail
- includes
other possible probabilities, for instance p = .75 and q = .25
- Bernoulli
distribution is binomial distribution when number of trials = 1 but this
doesn't come up much in statistics for data analysis
normal distribution
- the
limit of the binomial distribution when the number of trials becomes infinite
(as long as p isn't 0 or 1)
- describes
outcomes of multiple causal factors that tend to cancel each other's effects
yielding many scores near the mean, but less frequently work together in the
same direction to make scores higher or lower yielding scores in the positive
or negative tails
- gives
the familiar bell-shaped curve for any mean μ and standard deviation σ (variance σ2)
- has
no skewness (value of 0) or kurtosis (value of 3 means neither leptokurtic nor
platykurtic)
standard normal distribution (z)
- normal
distribution with μ = 0
and σ (and σ2) = 1
- describes
distances of scores Y from the population mean μ in units of the
population standard deviation σ, because z = (Y - μ)/σ (where μ
and σ could include the subscript Y as in μY and σY,
but that's usually assumed)
- also describes distances of sample means M from the population mean μ in units of the population standard error of the mean
(σM), because
z = (M - μM)/σM
- μM
is the mean of the sample means, ie the mean of the set of means of all the possible samples of a given size that could
be taken from a population; it's equal to μ, the mean of the scores
- σM is the standard deviation of the sample means, ie the
standard deviation of that same set of means of all the possible samples of a
given size that could be taken from a population; it's equal to σ/ĂN
t distribution
- like
z distribution but when population standard error of the mean σM is unknown and therefore is estimated by sample
standard error of the mean sM
- takes
on different shapes with more spread-out heavier tails depending on degrees of
freedom (df = number of observations minus 1), due to variability not just of
sample mean M varying in each sample but also of sM varying in each sample
- with
increasing and eventually infinite df (or pragmatically df > 120 or so) t
more and more closely approximates z
- describes
distances of sample means M from the population mean μ in units of the estimated standard error of the mean
(sM), because t = (M - μM)/sM
- μM
is the mean of the sample means, ie the mean of the set of means of all the possible samples of a given size that could
be taken from a population; it's equal to μ, the mean of the scores
- sM is the standard deviation of the sample means as
estimated from the sample itself, ie the estimated standard deviation of that
same set of means of all the possible samples of a given size that could be taken
from a population; it's equal to s/ĂN
- not
limited to sample means but also used with any statistic (specifically, the
statistic's distance from an hypothesized value) divided by its standard error;
for instance with regression b-weights, t = (b - β)/sb -- where β
is the population value of b typically hypothesized to be 0 under the null
hypothesis, and sb is
the estimated standard error of b, ie the standard deviation of all the values
of b that would be obtained from every possible sample of a given size, as
estimated from the sample itself
F distribution
- ratio
of two chi square (χ2) values when each is divided by its df
- since
χ2 divided
by df is the sampling distribution of the variance, F also represents the ratio
of two sample variances (each estimating the same underlying population
variance), which is how it's used in ANOVA
- characterized
by two degrees of freedom values, for numerator and denominator df (since the
numerator and denominator of the F ratio are each a variance with associated
df)
- when
numerator df = 1, F is equal to the square of the value of t on the denominator
df
chi square (χ2) distribution
- describes
possible values of the sum of a number of randomly sampled squared values from
the z distribution
- degrees
of freedom is the number of z scores squared and summed
- has
different shapes with increasing degrees of freedom, going from positively
skewed to more and more normally distributed
- mean
of a given χ2
distribution is its df; variance is twice its df
- typically
used to evaluate "goodness of fit" of various calculable statistics,
for instance with χ2 statistics measuring discrepancies between observed and expected
values in count data, and in contingency tables ("test of independence");
also used to evaluate model fitness with the "deviance" statistic for
which a model's likelihood (L) is calculated, then changed to its logarithm
(LL), switched from a negative to a positive sign (-LL), and multiplied by 2
(-2LL) to give "-2 log likelihood" that fits the χ2 distribution
- usually
pronounced "chi square" rather than "chi squared"; symbol
is lowercase Greek letter chi χ2, not uppercase Χ2
Poisson distribution
- describes
the probability of a discrete event occurring a given number of times in a
given time interval or spatial location, when the average rate of occurrence is
a known constant and the events occur independently of each other
- not
a continuous distribution like z, t, F, and χ2, but discrete like the binomial, giving a probability
for the event occuring 0 times, 1 time, 2 times, etc., but obviously not for it
occurring 2.5 times which would be impossible
- for
example, the number of meteors of a certain size hitting the Earth in a given
year, or the number of customers arriving at a counter or calling in to a call
center per hour, or the number of visits to an internet web site per minute, or
the number of goals in a soccer game, or the number of deaths per year for a
given age group, or a patient's number of psychotic episodes in a month
- the
logarithm of the event's expected frequency can be modeled using various predictors
in Poisson regression, where the dependent variable Y fits the Poisson
distribution
- Les
poissons, les poissons, how I love les poissons -- love to chop and to serve
little fish. First I cut off their heads then I pull out their bones. Ah mais
oui, a c'est toujours dlish! Les poissons, les poissons (hee-hee-hee, hoh-hoh-hoh),
with a cleaver I hack them in two; I pull out what's inside and I serve it up
fried. God, I love little fishes, don't you?
COMMONLY USED NON-PARAMETRIC (DISTRIBUTION-FREE) TESTS
Kendall's tau correlation
Spearman rank correlation
Kendall's W continuous measure of interrater agreement
Cohen's Kappa categorical measure of interrater
agreement
Kolmogorov-Smirnoff one sample or two independent sample
distribution comparison (analogous to independent measures t-test)
Mann-Whitney U for two samples (analogous to
independent measures t-test)
Wilcoxon signed rank test for dependent / paired /
matched samples (analogous to paired samples t-test)
Kruskal-Wallace one-way ANOVA using ranks
Friedman two-way ANOVA using ranks, also for repeated
measures ANOVA